27 research outputs found

    Hierarchical Scheduling for Multicores with Multilevel Cache Hierarchies

    Get PDF
    Cache-locality is an important consideration for the performance in multicore systems. In modern and future multicore systems with multilevel cache hierarchies, caches may be arranged in a tree of caches, where a level k cache is shared between Pk processors, called a processor group, and Pk increases with k. In order to get good performance, as much as possible, subcomputations that share more data should execute on processors which share a lower-level cache. Therefore, the number of cache misses in these systems depends on the scheduling decisions, and a scheduler is responsible for not just achieving good load-balance and low overheads, but also good cache complexity. However, these can be competing criteria. In this paper, we explore the tension between these criteria for online hierarchical schedulers. Formally, we consider a system with P processors, arranged in a multilevel hierarchy according to a hierarchy tree, where each of the P processors forms a leaf of the tree, and an internal node at level-k corresponds corresponds to a processor group. In addition, we assume that computations have locality regions, that represent parallel subcomputations that share data. Each locality region has a particular level, and the scheduler must ensure that a level-k locality region is executed by processors in the same level-k processor group, since they share a level k cache. Thus locality regions can improve cache performance. However, they may also impair load-balance and increase scheduling overheads since the scheduler must obey the restrictions posed by locality regions. In this paper, we present a framework of hierarchical computations, that is, computations with locality regions at multiple levels of nesting. We describe the hierarchical greedy scheduler, where each locality region is scheduled using a greedy scheduler which attempts to use as many processors as possible while obeying the restrictions posed by the locality regions. We derive a recurrence for the time complexity for a region in terms of its nested regions. We also describe how a more realistic hierarchical work-stealing scheduler can get the same bounds apart from constant factors for an important subclass of computations called homogenous computations. Finally, we also analyze the cache complexity of the hierarchical work-stealing scheduler for a system with a multilevel cache hierarchy

    Memory-mapped transactions

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.Includes bibliographical references (p. 149-154).Memory-mapped transactions combine the advantages of both memory mapping and transactions to provide a programming interface for concurrently accessing data on disk without explicit I/O or locking operations. This interface enables a programmer to design a complex serial program that accesses only main memory, and with little to no modification, convert the program into correct code with multiple processes that can simultaneously access disk. I implemented LIBXAC, a prototype for an efficient and portable system supporting memory-mapped transactions. LIBXAC is a C library that supports atomic transactions on memory-mapped files. LIBXAC guarantees that transactions are serializable, and it uses a multiversion concurrency control algorithm to ensure that all transactions, even aborted transactions, always see a consistent view of a memory-mapped file. LIBXAC was tested on Linux, and it is portable because it is written as a user-space library, and because it does not rely on special operating system support for transactions. With LIBXAC, I was easily able to convert existing serial, memory-mapped implementations of a B+-tree and a cache-oblivious B-tree into parallel versions that support concurrent searches and insertions.(cont.) To test the performance of memory-mapped transactions, I ran several experiments inserting elements with random keys into the LIBXAC B+-tree and LIBXAC cache-oblivious B-tree. When a single process performed each insertion as a durable transaction, the LIBXAC search trees ran between 4% slower and 67% faster than the B-tree for Berkeley DB, a high-quality transaction system. Memory-mapped transactions have the potential to greatly simplify the programming of concurrent data structures for databases.by Jim Sukha.M.Eng

    Composable abstractions for synchronization in dynamic threading platforms

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 259-269).High-level abstractions for parallel programming simplify the development of efficient parallel applications. In particular, composable abstractions allow programmers to construct a complex parallel application out of multiple components, where each component itself may be designed to exploit parallelism. This dissertation presents the design of three composable abstractions for synchronization in dynamic-threading platforms, based on ideas of task-graph execution, helper locks, and transactional memory. These designs demonstrate provably efficient runtime scheduling for programs with synchronization. For applications that use task-graph synchronization, I demonstrate provably efficient execution of task graphs with arbitrary dependencies as a library in a fork-join platform. Conventional wisdom suggests that a fork-join platform can execute an arbitrary task graph only with special runtime support or by converting the graph into a series-parallel computation which has less parallelism. By implementing Nabbit, a Cilk++ library for arbitrary task-graph execution, I show that one can in fact avoid introducing runtime modifications or additional constraints on parallelism. Nabbit achieves an asymptotically optimal completion-time bound for task graphs with constant degree. For applications that use lock-based synchronization, I introduce helper locks, a new synchronization abstraction that enables programmers to exploit asynchronous task parallelism inside locked critical regions. When a processor fails to acquire a helper lock, it can help complete the parallel critical region protected by the lock instead of simply waiting for the lock to be released. I also present HELPER, a runtime for supporting helper locks, and prove theoretical performance bounds which imply that HELPER achieves linear speedup on programs with a small number of highly parallel critical regions. For applications that use transaction-based synchronization, I present CWSTM, the first design of a transactional memory (TM) system that supports transactions with nested parallelism and nested parallel transactions of unbounded nesting depth. CWSTM demonstrates that one can provide theoretical bounds on the overhead of transaction conflict detection which are independent of nesting depth. I also introduce the concept of ownership-aware TM, the idea of using information about which memory locations a software module owns to provide provable guarantees of safety and correctness for open-nested transactions.by Jim Sukha.Ph.D

    On-the-fly pipeline parallelism

    Get PDF
    Pipeline parallelism organizes a parallel program as a linear sequence of s stages. Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have necessarily completed their processing. Pipeline parallelism is used especially in streaming applications that perform video, audio, and digital signal processing. Three out of 13 benchmarks in PARSEC, a popular software benchmark suite designed for shared-memory multiprocessors, can be expressed as pipeline parallelism. Whereas most concurrency platforms that support pipeline parallelism use a "construct-and-run" approach, this paper investigates "on-the-fly" pipeline parallelism, where the structure of the pipeline emerges as the program executes rather than being specified a priori. On-the-fly pipeline parallelism allows the number of stages to vary from iteration to iteration and dependencies to be data dependent. We propose simple linguistics for specifying on-the-fly pipeline parallelism and describe a provably efficient scheduling algorithm, the Piper algorithm, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested. The Piper algorithm automatically throttles the parallelism, precluding "runaway" pipelines. Given a pipeline computation with T[subscript 1] work and T[subscript ∞] span (critical-path length), Piper executes the computation on P processors in T[subscript P]≤ T[subscript 1]/P + O(T[subscript ∞] + lg P) expected time. Piper also limits stack space, ensuring that it does not grow unboundedly with running time. We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system. Our prototype Cilk-P implementation exploits optimizations such as lazy enabling and dependency folding. We have ported the three PARSEC benchmarks that exhibit pipeline parallelism to run on Cilk-P. One of these, x264, cannot readily be executed by systems that support only construct-and-run pipeline parallelism. Benchmark results indicate that Cilk-P has low serial overhead and good scalability. On x264, for example, Cilk-P exhibits a speedup of 13.87 over its respective serial counterpart when running on 16 processors.National Science Foundation (U.S.) (Grant CNS-1017058)National Science Foundation (U.S.) (Grant CCF-1162148)National Science Foundation (U.S.). Graduate Research Fellowshi

    Safe Open-Nested Transactions Through Ownership

    Get PDF
    Researchers in transactional memory (TM) have proposed open nesting asa methodology for increasing the concurrency of a program. The ideais to ignore certain "low-level" memory operations of anopen-nested transaction when detecting conflicts for its parenttransaction, and instead perform abstract concurrency control for the"high-level" operation that nested transaction represents. Tosupport this methodology, TM systems use an open-nested commitmechanism that commits all changes performed by an open-nestedtransaction directly to memory, thereby avoiding low-levelconflicts. Unfortunately, because the TM runtime is unaware of thedifferent levels of memory, an unconstrained use of open-nestedcommits can lead to anomalous program behavior.In this paper, we describe a framework of ownership-awaretransactional memory which incorporates the notion of modules into theTM system and requires that transactions and data be associated withspecific transactional modules or Xmodules. We propose a newownership-aware commit mechanism, a hybrid between anopen-nested and closed-nested commit which commits a piece of datadifferently depending on whether the current Xmodule owns the data ornot. Moreover, we give a set of precise constraints on interactionsand sharing of data among the Xmodules based on familiar notions ofabstraction. We prove that ownership-aware TM has has cleanmemory-level semantics and can guarantee serializability bymodules, which is an adaptation of multilevel serializability fromdatabases to TM. In addition, we describe how a programmer canspecify Xmodules and ownership in a Java-like language. Our typesystem can enforce most of the constraints required by ownership-awareTM statically, and can enforce the remaining constraints dynamically.Finally, we prove that if transactions in the process of aborting obeyrestrictions on their memory footprint, the OAT model is free fromsemantic deadlock

    Whole-genome sequencing reveals host factors underlying critical COVID-19

    Get PDF
    Critical COVID-19 is caused by immune-mediated inflammatory lung injury. Host genetic variation influences the development of illness requiring critical care1 or hospitalization2–4 after infection with SARS-CoV-2. The GenOMICC (Genetics of Mortality in Critical Care) study enables the comparison of genomes from individuals who are critically ill with those of population controls to find underlying disease mechanisms. Here we use whole-genome sequencing in 7,491 critically ill individuals compared with 48,400 controls to discover and replicate 23 independent variants that significantly predispose to critical COVID-19. We identify 16 new independent associations, including variants within genes that are involved in interferon signalling (IL10RB and PLSCR1), leucocyte differentiation (BCL11A) and blood-type antigen secretor status (FUT2). Using transcriptome-wide association and colocalization to infer the effect of gene expression on disease severity, we find evidence that implicates multiple genes—including reduced expression of a membrane flippase (ATP11A), and increased expression of a mucin (MUC1)—in critical disease. Mendelian randomization provides evidence in support of causal roles for myeloid cell adhesion molecules (SELE, ICAM5 and CD209) and the coagulation factor F8, all of which are potentially druggable targets. Our results are broadly consistent with a multi-component model of COVID-19 pathophysiology, in which at least two distinct mechanisms can predispose to life-threatening disease: failure to control viral replication; or an enhanced tendency towards pulmonary inflammation and intravascular coagulation. We show that comparison between cases of critical illness and population controls is highly efficient for the detection of therapeutically relevant mechanisms of disease

    Whole-genome sequencing reveals host factors underlying critical COVID-19

    Get PDF
    Critical COVID-19 is caused by immune-mediated inflammatory lung injury. Host genetic variation influences the development of illness requiring critical care1 or hospitalization2,3,4 after infection with SARS-CoV-2. The GenOMICC (Genetics of Mortality in Critical Care) study enables the comparison of genomes from individuals who are critically ill with those of population controls to find underlying disease mechanisms. Here we use whole-genome sequencing in 7,491 critically ill individuals compared with 48,400 controls to discover and replicate 23 independent variants that significantly predispose to critical COVID-19. We identify 16 new independent associations, including variants within genes that are involved in interferon signalling (IL10RB and PLSCR1), leucocyte differentiation (BCL11A) and blood-type antigen secretor status (FUT2). Using transcriptome-wide association and colocalization to infer the effect of gene expression on disease severity, we find evidence that implicates multiple genes—including reduced expression of a membrane flippase (ATP11A), and increased expression of a mucin (MUC1)—in critical disease. Mendelian randomization provides evidence in support of causal roles for myeloid cell adhesion molecules (SELE, ICAM5 and CD209) and the coagulation factor F8, all of which are potentially druggable targets. Our results are broadly consistent with a multi-component model of COVID-19 pathophysiology, in which at least two distinct mechanisms can predispose to life-threatening disease: failure to control viral replication; or an enhanced tendency towards pulmonary inflammation and intravascular coagulation. We show that comparison between cases of critical illness and population controls is highly efficient for the detection of therapeutically relevant mechanisms of disease
    corecore